Skip to content

Convert dataset generation to function-based, not class-based#161

Merged
nikhilwoodruff merged 60 commits into
mainfrom
functions
Aug 6, 2025
Merged

Convert dataset generation to function-based, not class-based#161
nikhilwoodruff merged 60 commits into
mainfrom
functions

Conversation

@nikhilwoodruff

@nikhilwoodruff nikhilwoodruff commented Jul 13, 2025

Copy link
Copy Markdown
Contributor

This PR modernises the dataset generation architecture by replacing the class-based approach with a simpler function-based system. The key changes include consolidating all dataset creation logic into a single create_datasets.py script, moving from Python 3.12 to 3.13, and simplifying the build pipeline.

The new architecture eliminates the complex class hierarchy for dataset generation in favour of straightforward functions that produce the same output. This makes the codebase easier to maintain and understand whilst preserving all existing functionality. The enhanced FRS dataset generation now follows a linear process: create base FRS, add imputations, uprate to 2025, calibrate with targets, then downrate back to 2023.

Additional improvements include simplified dependency installation using uv, streamlined CI workflows, and better organisation of local area data files.

@nikhilwoodruff nikhilwoodruff self-assigned this Jul 13, 2025
@nikhilwoodruff nikhilwoodruff marked this pull request as draft July 13, 2025 13:41
@nikhilwoodruff nikhilwoodruff force-pushed the main branch 2 times, most recently from 517f903 to b6b165f Compare July 14, 2025 17:09
@nikhilwoodruff nikhilwoodruff force-pushed the main branch 2 times, most recently from a2e0647 to 39701f5 Compare July 14, 2025 18:46

@anth-volk anth-volk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really loved reading through this @nikhilwoodruff. There are so many parts of this that are a real upgrade to what we currently do. I did a lighter review, given that this is not my area of expertise and I know you're looking to move fast. I left a few non-blocking nits that you might want to check out, but otherwise, love this trajectory.

Comment thread policyengine_uk_data/datasets/frs.py Outdated
Comment thread policyengine_uk_data/datasets/frs.py
Comment thread policyengine_uk_data/datasets/create_datasets.py Outdated
Comment thread policyengine_uk_data/datasets/frs.py
Comment thread policyengine_uk_data/datasets/frs.py
Comment thread policyengine_uk_data/datasets/frs.py
Comment thread policyengine_uk_data/datasets/imputations/capital_gains.py Outdated
Comment thread policyengine_uk_data/datasets/imputations/capital_gains.py Outdated
Comment thread policyengine_uk_data/datasets/imputations/wealth.py
Comment thread test.ipynb Outdated
@nikhilwoodruff nikhilwoodruff merged commit caafa1d into main Aug 6, 2025
3 checks passed
@nikhilwoodruff nikhilwoodruff deleted the functions branch August 6, 2025 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants